Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LinearLayouts] Faster pext algorithm #5621

Merged
merged 4 commits into from
Jan 15, 2025
Merged

[LinearLayouts] Faster pext algorithm #5621

merged 4 commits into from
Jan 15, 2025

Conversation

lezcano
Copy link
Contributor

@lezcano lezcano commented Jan 15, 2025

We also skip the LinearLayout test for HIP as it's currently failing.

Regarding the use of getWarpSize and getNumWarpsPerCTA, which are not correct for LinearLayouts with broadcasting as noted in #5617, we found almost all the uses are in AMD land. Changing these into calling the functions that act on the module is tricky, as the module is not currently accessible at the caller site in most of them. As such, we leave this refactor up to AMD folks.

We also skip the LinearLayout test for HIP as it's currently failing
Comment on lines +2796 to +2797
if is_hip() and isinstance(src_layout, LinearLayout):
pytest.skip("FIXME: LinearLayout not supported on HIP")
Copy link
Contributor Author

@lezcano lezcano Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @antiagainst regarding the HIP skip. See also the warpSize / numWarp comment in the OP.

@lezcano lezcano enabled auto-merge (squash) January 15, 2025 17:48
@lezcano lezcano merged commit 9895a1f into main Jan 15, 2025
7 checks passed
@lezcano lezcano deleted the reviews_reduce_linear branch January 15, 2025 17:59
lezcano pushed a commit that referenced this pull request Jan 31, 2025
This PR fixes a typo in the Windows implementation of `__builtin_clz`
that was introduced in #5621.

According to [this in-code
comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12)
these Windows implementations should have been copied from [this gist
snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0).
In the snippet however the `clz` implementation additionally [XORs the
result of
`_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53)
in order to convert the result from the <i>most significant bit</i>
produced by `_BitScanReverse` to the expected <i>number of leading
zeros</i>. I believe the implementation was copied to the triton without
the finalizing XOR by accident.

<b>What is affected by this error?</b>
This implementation of CLZ is used in
[`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635)
that is used in
[`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662)
that is used by
[`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247)
pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on
Windows in cases, where shared memory is needed to store temporary
reduced results.

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
anmyachev pushed a commit to intel/intel-xpu-backend-for-triton that referenced this pull request Jan 31, 2025
Closes #3273

This recent PR in upstream (triton-lang/triton#5621) brought a new
faster logic for `pext_i32` that is used in `ReduceOpToLLVM` pattern.
The new logic of `pext_i32` uses `__builtin_clz` intrinsic, that is
natively available in GCC and Clang, but is missing in MSVC. It seems
that the Windows version of this intrinsic was incorrectly copied from
[the given
source](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L44-L55),
so that it misses `r ^ 31` at the end of it, causing `tt.reduce(...)`
lowering to produce incorrect llvm IR in some scenarious.

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
AlexAUT pushed a commit to AlexAUT/triton that referenced this pull request Feb 6, 2025
…5774)

This PR fixes a typo in the Windows implementation of `__builtin_clz`
that was introduced in triton-lang#5621.

According to [this in-code
comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12)
these Windows implementations should have been copied from [this gist
snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0).
In the snippet however the `clz` implementation additionally [XORs the
result of
`_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53)
in order to convert the result from the <i>most significant bit</i>
produced by `_BitScanReverse` to the expected <i>number of leading
zeros</i>. I believe the implementation was copied to the triton without
the finalizing XOR by accident.

<b>What is affected by this error?</b>
This implementation of CLZ is used in
[`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635)
that is used in
[`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662)
that is used by
[`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247)
pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on
Windows in cases, where shared memory is needed to store temporary
reduced results.

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants